Multilingual Statistical Text Analysis, Zipf's Law and Hungarian Speech Generation
نویسنده
چکیده
The practical challenge of creating a Hungarian e-mail reader has initiated our work on statistical text analysis. The starting point was statistical analysis for automatic discrimination of the language of texts. Later it was extended to automatic re-generation of diacritic signs and more detailed language structure analysis. Parallel study of three different languages Hungarian. German and English using text corpora of similar size explores both similarities and differences. Corpora of publicly available Internet sources were used. The corpus size was the same (approximately 20Mbytes, 2.5-3.5 million word forms) for all languages. Besides traditional corpus coverage, word length and occurence statistics, some new features about prosodic boundaries (sentence beginning and final positions, preceding and following a comma) were also computed. Among others, it was found, that the coverage of corpora by the most frequent words follows a parallel logarithmic rule for all languages in the 40-85% coverage range, known as Zipf's law in linguistics. The functions are much nearer for English and German than for Hungarian. Further conclusions are also drawn. The language detection and diacritic re-generation applications are discussed in detail with implications on Hungarian speech generation. Diverse further application domains, such as predictive text input, word hyphenation, language modeling in speech recognition, corpus-based speech synthesis, etc. are also foreseen.
منابع مشابه
Zipf's law and the creation of musical context
This article discusses the extension of the notion of context from linguistics to the domain of music. In language, the statistical regularity known as Zipf’s law –which concerns the frequency of usage of different words– has been quantitatively related to the process of text generation. This connection is established by Simon’s model, on the basis of a few assumptions regarding the accompanyin...
متن کاملExplaining Zipf's Law via Mental Lexicon
Zipf's law is the major regularity of statistical linguistics that has served as a prototype for rank-frequency relations and scaling laws in natural sciences. Here we show that Zipf's law-together with its applicability for a single text and its generalizations to high and low frequencies including hapax legomena-can be derived from assuming that the words are drawn into the text with random p...
متن کاملWord unit based multilingual comparative analysis of text corpora
Parallel study of three very different languages Hungarian. German and English using text corpora of a similar size gives a possibility for the exploration of both similarities and differences. Corpora of publicly available Internet sources was used. The corpus size was the same (app. 20Mbytes, 2.5-3.5 million word forms) for all languages. Besides traditional corpus coverage, word length and o...
متن کاملParsing hungarian sentences in order to determine their prosodic structures in a multilingual TTS system
Naturally sounding synthesized speech requires proper prosodic structure. The unequivocal relation between syntax and prosody is contestable, but for lack of other information on discourse structure, we have to rely on syntactic structure in order to determine some prosodic features. This work – based on basic research results in Hungarian linguistics – started with a preliminary parser for sim...
متن کاملA Multilingual Evaluation of Three Spelling Normalisation Methods for Historical Text
We present a multilingual evaluation of approaches for spelling normalisation of historical text based on data from five languages: English, German, Hungarian, Icelandic, and Swedish. Three different normalisation methods are evaluated: a simplistic filtering model, a Levenshteinbased approach, and a character-based statistical machine translation approach. The evaluation shows that the machine...
متن کامل